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Abstract 

This paper is concerned with statistical methods for the analysis of lin- 
ear sequence data using Hidden Markov Models (HMMs) where the task 
is to segment and classify the data according to the underlying hidden 
state sequence. Such analysis is commonplace in the empirical sciences 
including genomics, finance and speech processing. In particular, we are 
interested in answering the question: given data y and a statistical model 
iT{x,y) of the hidden states x, what shall we report as the prediction x 
under 7r(a::|y)? That is, how should you make a prediction of the underly- 
ing states? We demonstrate that traditional approaches such as reporting 
the most probable state sequence or most probable set of marginal predic- 
tions leads, in almost all cases, to sub-optimal performance. We propose 
a decision theoretic approach using a novel class of Markov loss functions 
and report x via the principle of minimum expected loss. We demon- 
strate that the sequence of minimum expected loss under the Markov loss 
function can be enumerated using dynamic programming methods and 
that it offers substantial improvements and flexibility over existing tech- 
niques. The result is generic and applicable to any probabilistic model on 
a sequence, such as change point or product partition models. 



1 Introduction 
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This paper is concerned with statistical methods for the analysis of linear se- 
quence data using Hidden Markov Models (HMMs) where the task is to segment 
and classify the data according to the underlying hidden state sequ ence. Such 
analysis is commonplace in the empir i cal sc iences incl uding genomics iDav et al. 



2007t iMaioros et al 
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Crowder et al. . 2005 : Rossi a.iid Gallo.l2006t Banachewicz et al. , 2008,1 and sp eech 



Chien and Furuil . l2005t lYan et all . l2007t IWeiss and Ellisi [2008f . In 



processing 

particular, we are interested in answering the question: given data y and a 

statistical model 7r(x,2/) of the hidden states x, what shall we report as the 

prediction x? Furthermore, if we are interested in departures from a particular 

null state, how can we calibrate the predictions such that they achieve given 

frequentist performance measures such as the rate of false positive classification 

errors. 
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Figure 1: SNP geno typing data set. A SNP geno typing data comprises two 
sets of measurements - the Log R Ratio and the B allele frequency - measured 
at multiple locations along the genome. Alterations in the distributions of the 
measurements correspond to underlying changes in the DNA copy number. Each 
coloured region corresponds to a different underlying DNA copy number state. 



In this paper we formalise the problem within a Bayesian decision theo- 
retic framework. Traditional approaches, such as reporting the most probable 
state sequence or the most probable set of marginal predictions, is equivalent to 
assuming particular loss functions that maybe inappropriate for segmental anal- 
ysis of linear sequence data and can lead to unfavourable outcomes. Following 
this, we propose a new class of Markov loss function that penalises the misclas- 
sification of state occupancy and transitions which are errors of direct relevance 
in many segmental classification problems. Under the Markov loss function, the 
state sequence with minimum expected loss (or maximum expected utility) can 
be enumerated using dynamic programming methods and can provide a sim- 
ple, yet effective means of extending many existing statistical models of linear 
sequence data. 

As a motivating example throughout the paper, we shall consider the prob- 
lem of identifying from DNA copy number variation using array comparative 
genomic hyb ridisation (aCGH) or single nucleotide polymorphism (SNP) geno- 
typing data [Redon et all 120061: IColella et all 120071: IScherer et all |2007| . Al- 
though all we present here is generic to any segmentation problem involving 
linear sequence data. 

Copy number variants (CNVs) are segments of DNA that are > Ikb in length 
and occur at variable copy number relative to a reference genome. In humans, 
we typically possess two copies of every gene, one inherited from each of our 
parents. However, in genomic regions containing copy number variation, it is 
possible to have less than two copies, in which case that region is then said to 
harbour a copy number loss or deletion, or more than two copies, where the 
region is then said to contain a duplication. In rare genetic disorders, whole or 



partial copies of entire chromosomes can be lost or gained; for example, Down's 

Syndrome is caused by the gain of an extra co py of chromo some 21 . 

M ore recently, numerous studies [Sebat et al .. 2004; Rodo n et allbOOalEgan et al. 



20071 : Emerson et al.l . |2008J | have shown the existence of a complex landscape 
of sub-micro scopic copy number variation that can influence gene ex pression 



Stranger et al.. 2007] and potential ly be c ausal factors in genetic causes [Pelham et al. 



20061 : ISebat et al.l . 120071 : IXu et al.l |2008{ . These studies have used microarray 



technology, such as aCGH or SNP genotyping arrays, to measure the DNA copy 
number at hundreds of thousands to millions of locations along the genome. 

As an illustration. Figure [1] depicts a SNP genotyping dataset that mea- 
sures variation in DNA copy number along a particular chromosome from DNA 
derived from tumour cells. The statistical problem is to divide the sequence 
up into regions and classify each region by the underlying DNA copy number. 
Many Hidden M arkov model based meth ods have been d eveloped for this task 
see for example IZhao et al.l 120041: Mari oni et al.l l2006l: IColclla et al. 12007" 



Wang et al.l |2007| : iKorbeTet al.l |2n07J : ICahan et all |2008| : lGuha et al.l |2008| 'l 



but the copy number predictions reported are typically those of the most proba- 
ble state sequence found using the Viterbi algorithm or the most probable set of 
marginal predictions using the forward-backward algorithm. We will show that 
for applications, such as copy number calling, standard approaches for report- 
ing the hidden state sequence are not always entirely applicable to the problem 
and that a decision theoretic approach using Markov Loss functions maybe of 
greater utility and provide increased fiexibilty. 

2 Decision Theory 

To begin we shall define some notation, let Xi € {0, . . . , 5*} denote the true un- 
observed underlying state at the i = 1, . . . , n locations, and j/i the corresponding 
observation. The task is to obtain a prediction x = {xi, . . . , Xn} given a statis- 
tical model Tr(x\y) (for notational simplicity, we shall suppress the conditioning 
on y in the following and refer to T:{x\y) as 7r(a;)). In this paper, we focus on the 
case where x forms an unobserved discrete Markov chain, however, we stress 
that our approach is generic and not dependent on this Markov condition. 
Decision theory Bergen . Il985t iBernardo and Smithl . |2000| provides an ax- 



iomatic framework for making optimal decisions via the principle of minimum 
expected loss (or maximum expected utility). In our problem the "decision" 
is the reporting of x from which a set of actions will be taken with associated 
losses based on the unknown true state of nature x. We encapsulate the forms 
of error into a loss function l{x\x) which quantifies the loss of taking actions on 
X when the true state of nature is x. The principle of minimum expected loss 
(MEL) prescribes one should report x as 

X = 'Aigmin E^^^)[l{x\x)], 

X 

— argminN /(a;|a;)7r(a;). 

X -*- — ^ 

X 

For example, one potential loss function is to use 

l{x\x) ^ / 0' if ^ = 2^ Q-) 

^ ^ ' 11, otherwise 



which leads to the reporting of the most probable joint state sequence x = 
argmaxa; 7r(x). For Hidden Markov Models, the mos t probab l e join t state se- 
quence can be computed using the Viterbi algorithm [Rabinerl . Il989{ . We shall 
refer to this as the global loss function as a constant penalty is incurred if the 
prediction is not completely correct. Alternately if we choose 

i{x\x) = y^ i{xi\xi) 



with 



^(^^|^^)-1 ?; It^^ (2) 



then the MEL is the set of marginal predictions Xi = argmax^;; ^^ 7r({xi, X-i}) 
where the summation is over X-i, the state sequence other than Xi. We shall 
refer to this as the marginal loss function as it considers each location indepen- 
dently of the others. With Hidden Markov models, the marginal probabilities 
can be calculated exactly using the forward-backward algorithm iRabincr, . il989| . 

3 Motivation 

Consider the simulated data sequence in Figure [H^a) for a two-state Hidden 
Markov model which contains a region of elevated signal related to an underly- 
ing change in the hidden state. Figure [2Ib-e) shows the true underlying state 
sequence and four different predictions of this state sequence. All of the predic- 
tions contain the same number of misclassifications; for example, Figure [UJc-e) 
contain the same number of false negatives (if we assume state to be a null 
state such that the notion of a false negative applies). It is apparent though 
that the predicted sequences are qualitatively very different and could lead to 
quite different actions if decisions are taken upon them. For example. Figure 
[5Jd) contains three non-null segments whilst Figure [21[e) contain two segments. 
It is clear therefore that simply counting the number of state mis-classifications 
is insufhcient. 

3.1 Problems with global and marginal loss functions 

If we consider the global loss function in ((T)) , this loss function assigns a penalty 
of one if the predicted sequence x is not exactly identical to the true sequence x. 
This loss function is extreme in the sense that no matter how many classification 
errors are made, the same penalty is incurred - it is an all-or-nothing approach. 
Furthermore, for this loss function the entirety of the sequence is important, 
the optimal prediction must be globally and locally correct. In contrast, the 
marginal loss function, defined in 1^, ignores any form of local or global struc- 
ture. It concentrates instead on penalising classification error at each location 
considered independently of others which is equivalent to stating that the overall 
loss is invariant to permutations of the sequence {xi,Xi}"^i. It is apparent that 
these two commonly used loss functions correspond to quite opposite extremes 
and neither scenario maybe appropriate in segmental classification problems. 

For example, in many situations it is rare for errors to be completely in- 
tolerable, rather there are acceptable tolerance levels for error. Under these 
circumstances it would not be appropriate to use the global loss function in 
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Figure 2: Sequence predictions. An example data set (a) and four predictions of 
the underlying state sequence (b-e). (Grey, solid) Predicted and (Black, dashed) 
true state sequence. 



which the same penalty is incurred irrespective of how many errors are made in 
the prediction. Furthermore, if we are interested in the segmental classification 
of linear sequences and we expect dependencies between states at different loca- 
tions, it does not seem appropriate to use a marginal loss function that considers 
classification error at each location independently of the others; see for example 
Figure [5] (c-e) . 

Nonetheless, the appeal of these loss functions is that the computation of 
the state sequence with minimum expected loss is often analytically tractable 
or simple to approximate with commonly used statistical models. With Hidden 
Markov Models, the Viterbi algorithm allows the most probable sequence se- 
quence to be found whilst the forward-backward algorithm allows the marginal 
probabilities 7r(xi) = ^^ . 7r(a;) to be calculated in a time which is linear in the 
length of the data sequence. 



3.2 Example application 

In the identification of DNA copy number alterations from aCGH or SNP geno- 
typing data the reporting of the most probable copy number state sequence 
or the set of most probable marginal copy number predictions is commonplace. 
However, the loss functions defined by ([Ij and © do not necessarily encapsulate 
the types of error that need to be considered in practice. 

A recent exam ple of a statistical model-based approach for CNV discov- 
ery is QuantiSNP |Colella et al.l . [2007 1. This uses a Bayesian Hidden Markov 



model framework to identify CNVs from Illumina SNP genotyping data. Quan- 



tiSNP applies the Viterbi algorithm to obtain a final copy number state se- 
quence prediction after fitting the Hidden Markov model to the SNP data using 
an expectation-maximisation algorithm. Furthermore, QuantiSNP also assigns 
a measure of uncertainty to each putative CNV by calculating the Bayes Factor 
Tr{xi;j = CNY\y)/ir{xi;j = NULL) which gives the ratio of the marginal prob- 
ability that the region, spanned by the probes i to j, contains a CNV to the 
marginal probability that the region does not contain a CNV. False positive 
CNV calls can be reduced by setting an appropriate threshold for the mini- 
mum acceptable Bayes Factor. This procedure though has a weakness in that 
it conditions on the Viterbi sequence. 

Given a statistical model, the Viterbi algorithm will return the state se- 
quence that has maximum probability and there will typically be only one such 
sequence. If an experimentalist wished to increase their power to detect CNVs 
and was willing to accept increased numbers of false positives it would not be 
possible to accommodate this as there is a presumption of a global loss func- 
tion on the underlying copy number sequence. It may be possible to adopt a 
marginal loss function instead but, as we have already highlighted, the use of 
a marginal loss function seems inappropriate as it ignores local structure by 
classifying each probe independently yet the overall goal of the algorithm is to 
discover local structure due to copy number variation. 

Other copy number detection methods, b ased on Hidde i i Mar kov models, 
such as dCh i p IZhao et al.l . |20Q4|. BioHMM iMarioni et all . l2006l. PennCNV 



Wang et all |2007| and wuHMM [Cahan et al.l . |2008| also condition on the 



Viterbi sequence and are therefore subject to the same problem. 

3.3 Markov Loss Function 

It seems apparent that a more general loss function is required that sits in 
between the extremes of the global and marginal loss functions. One which 
takes into account some notion of local sequence structure whilst still remaining 
amenable to tractable computations. 

Let us return to the predicted sequences in Figure [21 None of the four 
predictions depicted returns the true sequence and each sequence contains the 
same number of mis-classified locations; however, the predictions differ in the 
number of state transition errors and hence the number of segments reported. If 
we assume that one of the states can be considered to be a null or resting state, 
one approach, from the many possible, is to characterise four types of error in 
making the decision to switch from a state Xi = j to ii+i 7^ j: 

1. X predicts a state transition where there is none. We refer to this as a False 

Positive Transition (FPT). 

2. X misses a state transition where one actually exists. We refer to this as a 

False Negative Transition (FNT). 

3. 4. X correctly predicts that there is no state transition but calls the incorrect 

state. We refer to these as False Positive Calls (FPC) and False Negative 
Calls (FNC) respectively. 

There is also a fifth error type: 



5. X predicts a state transition in a direction that is opposite to the truth. 
We refer to these as Doubly False Transition (DFT). This is the most 
catastrophic error type when the predicted change point is structurally 
incompatible with the true changepoint. A high loss should be associated 
with this error. 

We can now begin to differentiate the sequences in Figure[2jb, c) from those 
shown in Figure [^Jd, e). Whilst Figure [21[b, c) each contain one pair of false 
positive and negative transition errors and report only one segment, those in 
Figure [H^d, e) consist of multiple transition errors and report a number of seg- 
ments. In a CNV calling application, each state transition error corresponds to 
either a incorrectly specified breakpoint or a false putative CNV call and these 
errors carry implications for the cost of follow-up studies since each putative 
CNV (segment) might require further experimental validation. 

The importance of state transition errors motivates the use of a loss function 
that assigns penalties to false state transition calls. We now propose a class of 
loss function that we will refer to as Markov Loss functions, 

l{xi^^+i\x,,i+i) = i A • (3) 

' [1, otherwise 

where Xi^i+i = {x.^, Xi+i}. In the following section, we will describe how to 
calculate the expected loss for this novel class of loss function, as well a dynamic 
programming solution to obtain the sequence that has the minimum expected 
loss under this loss function. 



4 Method 

4.1 Calculating the Expected Loss under the Markov Loss 
function 

Let a; be a state sequence, where Xi G {0, . . . , 5}, i = 1, . . . , n, and S is the total 
number of possible states. The loss function l(x\x) shall consist of a sum of 
Markov loss functions that penalises transitions in x, 

n-l 

l{x\x) = '^l{xi^i+i\xi,i+i). (4) 

The expected loss is given by, 

E„^^)[l{x\x)] = X! 1 ^Ki^,i+l\^i,^+l) ( ^i^) (5) 

X { i=l ) 

where, by exchanging the order of summation, 

ri-l ( "\ 



E^(^^)[l{x\x)] = ^l^l{x^^^+l\x^^^+l)■K{x)\ , 
1=1 la; J 

n 1 ( 



i—1 I Xi 



Table 1: Loss matrix structure for binary state transitions. 



l{x\x) 



(0,0) 



(0,1) 



(1,0) 



(1,1) 



(0,0) 
(0,1) 
(1,0) 

(1,1) 



Ctp Cfnt + Cfnc Cfnt Cfnc 

Cfpt + Cfpc Ctp Cdft + Cppc Cfpt 

Cfpt Cdft + Cfnc Ctp Cfpt + Cfnc 

Cfpc + Cppc Cfnt Cfnt + Cppc Ctp 



4.2 Dynamic Programming 

As the expected loss for the Markov loss function is additive, the prediction x 
that has MEL can be found using the following dynamic programming recursions 
(in similar fashion to the Viterbi algorithm): 

4.2.1 Forward recursion 

Compute, 

^i(fc) = min£:^(^^2)[^(*i,2 = (i,fc)ki,2)], 

3 

6i{k) = argmin£;^(a;i2)[;(.Ti^2 = (.?, fc)|a;i,2)], 

where k e {0, . . . , S*} and then for i = 2, . . . , n, 

(j)i{k) = min[0i_i(j) +L(xi,.,_i = (j, fc))], 
j 

6i{k) = argmin[(?!)i_i(j) +L(ii,i_i = (j, fc))], 

3 

where L{xi^i-i) = Y.x,a-i Kxi,i-i\xi,i-i)T^{xi^i-i). 

4.2.2 Backward trace 

Find Xn = argmiufc (f>n{k) then Xi-i = Si{xi),i — n — 1, ... ,2. 

4.3 Loss Matrix 

For binary sequences {S = 1), the loss matrix consists of 4 x 4 = 16 elements, 
each corresponding to one of the possible prediction-truth state transitions. Ta- 
ble [T] shows one design of the loss matrix which consists of six unique loss values 
Ctp, Cfpc, Cfnc, Cfpt, Cfnt and Cdft- The value of Ctp corresponds 
to the loss incurred with making the correct prediction which we normally set 
to zero, whilst the other loss values correspond to the penalties incurred for 
each of the transition errors described previously. It should be noted that the 
marginal loss function is a particular case of our Markov loss function when the 
losses associated with transition errors are zero {Cfpt — Cfnt — 0). For the 
global loss function, the loss matrix would consist of 2" x 2" entries where each 
element corresponded to a different {prediction-truth} sequence pair with zeros 
on the diagonal and all ones on the off-diagonal elements. 

In our example application of CNV calling an appropriate loss matrix design 
would assign high loss to catastrophic doubly false transition errors Cdft ^ 1 



which corresponds to predicting the start of a CNV where a CNV region is 
actually ending. The values of the other loss parameters are study-dependent. 
In a CNV application, large values of Cppc and Cfpt relative to Cfnc and 
Cent would lead to fewer misclassified probes and CNV calls as the penalities 
for false positive state classifications and transitions are severe. In contrast, 
relatively smaller values of Cppc and Cfpt would encouraged the production 
of more putative CNV calls, and therefore increase the power to detect copy 
number variants albeit at the expense of an increase in false positive CNV calls. 
One point of note is that the Markov loss function increases the number of 
free parameters that need to be specified. This additional parameterisation is 
necessary in order to obtain the flexibility that is often required in empirical 
data analysis. Careful design of the design matrix and consideration of the 
problem can minimise the number of free parameters. For example, in the loss 
matrix shown in Table [TJ there are only three free parameters Cfnt/Cfpt, 
Cfnc/Cfpc and Cfpt/Cfpc once Cdft and Ctp are given since the actual 
scale of the loss matrix does not matter. 

4.4 Computational requirements 

The order of computation required is 0{S'^N), where S is the number of hidden 
states in the Hidden Markov model and A^ is the sequence length, since a sum- 
mation is required over all possible pairs of the true hidden states x^^i+i and 
predictions Xi^i+i. This can be prohibitive for applications involving large state 
spaces but is computationally manageable for small numbers of hidden states. 
In practical situations though it is often the case that the posterior probability 
distribution assigns high probabilities to a few transitions whilst the reminder 
have negligible probability. For data exhibiting sparse properties, these fea- 
tures can be exploited in order to derive approximate alg orithms for inference 



in Hidden Markov models (see ISiddiai and Moord [2005|) that can offer sub- 



stantial computational gains at the expense of little error if the assumption of 
sparseness hold. 

4.5 Uncertainty in the statistical model 

We have assumed throughout the availability of the exact statistical model 
T:{x\y). In general, of course, it is rare in practice to have access to the ex- 
act statistical model and instead the model is known up to a form n{x, 9\y) that 
includes some unknown model parameters 6. The prediction must then satisfy. 



X — argmm 



^ L{z;x)TT{x,e\y) 

argmin > L{z\x)'K{x\y), 
xex 

argmin > L{z\x)Ti{x\y) 



dO, 



x£X 



where, in the second line, the independence of the loss function and the model 
parameters allows 9 to be integrated out of the model iT{x,9\y) and the problem 
is reduced to the same form as before. The integral required will generally 



be analytically intractable and an estimate 7r{x\y) must be used that can be 
obtained using Monte Carlo simulations, variational methods or by conditioning 
on point estimators (such as the MAP). 

5 Results 

5.1 Simulations 

We performed a simulation study to examine the properties of predictions made 
by the use of global, marginal and Markov loss functions in a generic segmental 
classification setup. 

5.1.1 Setup 

We simulated 1000 data sequences of length n — 1000 from a Hidden Markov 
Model with Gaussian observation densities, 

Tr{yi\xi,^i,a'^) = N{n.^^,a^),i = 1, . . . ,n, 
Tr{xi ^ j\xi-i = k) = T(j, fc),i = 2,...,n, 
7r(a;i) = v{xi), 

where /i = {0, 1}, ct^ — 1, the prior state vector h' — [0.5, 0.5]^ and the transition 
matrix T is given by. 



T{j,k) 



1 - Pi Pi 

P2 1-/02 



0, 


Xi = Xi, 




Cfpc, 


Xi = l,Xi 


= 


Cfnc, 


Xi = 0, Xi 


= 1 



with pi = 0.01, /92 — 0.05. With probability r] — 0.01 we added outliers gener- 
ated from a Gaussian distribution with mean and variance 3^. 

For each simulated dataset, we used the Viterbi algorithm to find the most 
probable state sequence x^ and the forward-backward algorithm to obtain the 
marginal state and switching probabilities 'K{xi\y) and 7r(xi_i+i \y). Using a more 
generalised form of the marginal loss function, 



i\Xi\Xij — 



in which the ratio Cppc /Cfnc allows us to control the Type I/II location- 
wise classification errors. We also obtained marginal loss predictions x^. with 
Cfnc — 1 and values of Cfpc from 0.1 to 10. 

Sequence predictions Xmi were also obtained using the Markov loss function 
where we set Cdft ~ 1000 and explored a range of values for the ratios Cfpt 
and Cfpc from 0.1 to 10 with Cfnc = Cfnt = 1- 

5.1.2 Example simulated dataset 

Figure EJa) shows one of the simulated data sets which contains a number of 
segmental alterations. In Figure [S^b, c), the Viterbi and Marginal Loss pre- 
dictions, with unit losses, produce similar state sequence predictions. However, 
decreasing the cost of false positive calls allows the marginal loss prediction to 
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Figure 3: Analysis of simulated dataset. (a) Data containing jumps in sig- 
nal intensity corresponding to underlying changes in the hidden state sequence 
(black). Predicted state sequences (grey, o) using the (b) Viterbi algorithm, 
(c) marginal loss function with Cppc = Cfnc = 1, (d) Markov loss function 
with Cfpt = CpNT — Cppc = CpNC = 1) (e) marginal loss function with 
Cppc = 0.1 and (f) Markov loss function with Cppc = 0.1. Arrows indicate 
false positive segments which are singletons or short segments. The average 
false positive call rate for (e) and (f) are the same. 



recover more of the underlying state transitions at the cost of more false posi- 
tives as shown in Figure [31[e). These false positives consist of multiple singletons 
or short segments that span just a few locations and are generally undesirable 
in segmental classification. Figure |31Jd, f) shows the predicted state sequences 
using the Markov loss function. As the cost of false positive calls is reduced, 
the predictions are able to capture more of the underlying true state sequence 
without producing singletons or short segments. Predictions made under the 
marginal loss function can be expected to contain such features can be expected 
since local sequence structure is not considered whereas the Markov loss function 
explicitly considers state transition errors and avoids generating such features. 

5.1.3 Average performance using the global and marginal loss func- 
tions 

Figure 2] shows overall classification performance, in terms of average rates of 
false positive and negative calls and transitions, on the simulated datasets using 
the Viterbi and marginal loss predictions. Since there are no free parameters 
associated with a global loss function, the performance using the Viterbi pre- 
diction is fixed and cannot be varied. In contrast, the marginal loss predictions 
are able to provide a continuum of different performance rates that depend on 
the relative values of the loss associated with false positive and negative state 
classifications. 
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Figure 4: Performance on simulated data sets. Classification error rates for (a) 
non-breakpoint locations, (b-c) false positive and negative transition. Classifi- 
cation (-I-) Viterbi. (a) Marginal Loss, (o) Markov Loss {Cfpt = 1, Cppc 
variable), (x) Cppc — 1 and Cppp variable. 



Under the marginal loss function, as the relative magnitude of the loss asso- 
ciated with a false positive to a false negative call is increased, the false positive 
rate (% FPC) decreases, whilst the number of false negatives (% FNC) increases. 
This is expected behaviour since the relative penalty associated with false posi- 
tive errors is increasing. However, we note that the Viterbi prediction is able to 
produce, at the same rate of false positive calls, far fewer false positive transi- 
tions (% FPT) than the marginal loss predictions. This is not unexpected since 
the global loss function, which the Viterbi prediction assumes, favours predic- 
tions that are structurally correct on a global and local scale. In contrast, the 
marginal loss function is ignorant of any structure and is more likely to produce 
state sequence predictions that contain small or even single location segments, 
thus contributing to an inflation in the rate of false positive transition errors as 
we saw in Figure [31 

5.1.4 Average performance using the Markov loss functions 

Figure [H shows the average classification performance using the Markov Loss 
function on the simulated datasets for fixed Cppp = 1 with Cppc varying and 
for fixed Cppc = 1 with CppT varied. For fixed Cppp — 1, we observe that 
the predicted sequences using the marginal and Markov loss functions achieve 
similar false positive and false negative call rates based on the number of loca- 
tions that are incorrectly called. However, at the same false positive call rate, 
the method based on the Markov loss function gives fewer false positive and 
negative transitions. 

When Cppc = 1 and CppT is varied, the performance of the Viterbi and 
marginal predictions are uniquely specified but we are able to control the rates 
of false transition error with the Markov loss function. Whilst there was rela- 
tively little change in the false positive transition rates when Cppc was varied, 
alterations to Cppp have a more pronounced effect on the numbers of false 
positive transitions. This is expected since Cppx is the loss associated with 
making false positive transition errors whilst Cppc primarily penalises errors 
in the classification of individual states. 

One point of interest is that, at an FPC rate of approximately 2.2%, we notice 
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that an appropriate choice of Cfpt allows the predictions under the Markov 
Loss function to achieve similar performance to the Viterbi predictions. In 
contrast, it is not possible to access this level of performance using the marginal 
loss function no matter what changes are made to the loss parameters. This 
shows empirically that predictions, under the Markov loss function, do indeed 
allow us to access performance levels that span the range between that of the 
marginal loss and global loss functions. 

5.2 Flexible CNV calling using SNP genotyping data 

We compared CNV analysis using QuantiSNP using the original Viterbi al- 
gorithm and a modified version using the proposed method based on Markov 
Loss functions. We used SNP genotyping data, acquired from the lUumina Hu- 
manHap650Y SNP genotyping array, for the human indivi dual NA15510 which 
has been previously well-studied in the CNV literature [Korbel et al.l . 12007 : 



Scherer et al.l . |2007| . As there are no gold standard, fully characterised human 
genomes, in terms of CNV content, we measured performance by comparing 
CNV calls from both methods studied with CNV loci stored on the Database 



of Genomic Variants (DGV) [lafrate et al.l . l2004l | . CNV calls were not filtered 



by thresholding via the Bayes Factor uncertainty measure in QuantiSNP and 
performance was measured solely on the predicted state sequence. In order to 
apply the Markov loss function, we extracted marginal switching probability es- 
timates -Kixi^i+i) from QuantiSNP and we explored a range of values for Cppc 
and Cfpt as before. 

QuantiSNP, with the original Viterbi copy number sequence predictions, 
produced 65 CNV calls of which 35 could be mapped on to CNV loci in the 
DGV as shown in Figure [nja). With the Markov loss function, we were able 
to access a wider spectrum of performance by modifying the loss parameters 
as shown in Figure El^b, c). What is of particular interest is that the predicted 
copy number sequences returned by the Viterbi method does not necessarily 
contain all the possible CNVs. For example, with the following loss parameters 
Cfpt = 0.1 and Cppc = Ij the Markov loss method produces 215 CNV calls 
with 81 overlapping a CNV locus in the DGV. This suggests that there maybe 
a further 46 CNVs contained in the sample that were not identified from the 
Viterbi prediction but are accessible to the Markov loss-based method, if one is 
prepared to accept higher levels of false CNV calls. 

The key point to emphasise here is that the use of the Markov loss functions 
gives a much more flexible approach for CNV data analysis and enables the 
experimenter to analyse their data subject to their own tolerances for the num- 
ber of false positives and negatives not that imposed on them by the algorithm 
developer. 

6 Discussion 

Hidden Markov Models have a rich history dating back over 30 years in signal 
processing, finance and more recently genomics. Despite the enormous breadth 
of applications there are only two current approaches in the literature to calling 
the hidden state sequence, the Viterbi-based (MAP) or marginal state predic- 
tions. The Viterbi prediction lacks flexibility whilst the marginal predictions 
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Figure 5: CNV analysis of individual NA15510 using QuantiSNP. (a) Number 
of CNV calls and matches to the Database of Genomic Variants (DGV) from 
the Viterbi prediction (+) and the Markov Loss prediction (x) for different loss 
parameters. The number of CNV calls using the Markov loss function based 
predictions when (b) as a function of the false positive call rate and (c) as a 
function of the false positive transition rate. The circled lines (o) indicate the 
total number of calls made whilst the dotted line (•) indicates the corresponding 
number of calls which match an entry in the Database of Genomic Variants. 



seem inappropriate for segmentation applications in which Hidden Markov mod- 
els are frequently applied. Here, we have presented a state sequence prediction 
method that uses a class of Markov loss functions which is more appropriate loss 
function for the segmentation and classification of linear sequence data using 
Hidden Markov models. The method is simple but we show that in applications, 
such as CNV calling, it is more intuitive to consider the correct classification of 
state transitions rather than the entirety of the state sequence or of individual 
states. In these scenarios the Markov loss function can be more suitable than 
the global or marginal loss functions that are often used in real applications. 

The calculation of the posterior expected loss with respect to a Markov loss 
function was shown to have a simple form and a dynamic programming algo- 
rithm was provided to compute the state sequence with the minimum expected 
loss. Although, the emphasis was on the Hidden Markov model, this method 
can be applied to any statistical model for the segmentation and classification of 
linear sequence data that can provide estimates of the marginal state transition 
probability 7r(a;i_i+i). Therefore it can be used to augment, without modifi- 
cation, many existing statistical methods for analyzing se quence data such as 
those based on semi-Markov model s, change point methods iFearnhead and Liu , 
|2007 | or product partition models Barry and Hartiganl . ll992 |. Whilst it is rela- 
tively simple addition the application of this method could greatly enhance the 
adaptability of many existing statistical algorithms transferring power to the 
experimenter to allow them to assign losses to various error types relevant to 
their own study. 

Alternatives to maximum a posteriori estimators has been used in Hid- 
den M arkov model-based DNA segmentation application s by Aston and Martini 
|2007f and for clustering applic a tions by tBindcr [1978] and lLau and GreenI 2007j . 
In particular, iLau and GreenI |2007l | have argued that maximum a posteriori 
clustering estimates have no objective status in Bayesian theory as an "opti- 
mal" estimate of the data partition. Posterior modes can be unrepresentative of 
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the centre of mass for the posterior distribution in high dimensional problems 
and point estimators should only be obtained after specifying some loss func- 
tion and then computing the optimal parameter estimate that minimises the 
posterior expected loss. 

Throughout this paper we have not explicitly stated how the loss values 
should be selected. This is purposeful because the selection of the costs asso- 
ciated with various error types is entirely study- dependent and it is only the 
data analyst who can develop the criteria for the design of the loss matrix. For 
example, in CNV analysis, costs might be related to tangible quantities such 
as the financial, time and man power requirements for follow-up studies and 
validation taken upon the predictions. 

It is also of further research interest to characterise the effect on predictions 
when only an approximation of the statistical model is available. Furthermore, 
in some applications there maybe some utility in combining of Markov loss func- 
tions on the hidden state sequence x and loss functions on the model parameters 
0. The Markov loss function introduced here focuses on costs associated with 
classification errors of the hidden state sequence and assumes that the model 
parameters are in some sense nuisance variables. There are applications where 
both the state sequence and model parameters maybe of interest; for example, 
the transition matrix may have some interpretation for a given application and 
a loss function maybe given on 9. In these instances it maybe necessary to 
derive optimal joint predictions (x, 9) under the appropriate loss functions. 
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